gghist <- function(var, var_name){
ggplot(data = data, aes(var)) +
geom_histogram(bins = 30, aes(y = stat(width*density)),
fill = 'white', col = 'black') +
xlab(var_name) + ylab("Relative Frequency") +
scale_y_continuous(labels=percent_format()) +
ggtitle(paste(var_name, "Distribution")) +
theme_bw()
}
gghist.group <- function(var, var_name){
ggplot(data, aes(var, group = ArtistGen)) +
geom_histogram(bins = 30, aes(y = stat(width*density)),
fill = 'white', col = 'black') +
xlab(var_name) + ylab('Relative Frequency') +
scale_y_continuous(labels=percent_format()) +
ggtitle(paste(var_name, "by Generation")) +
theme_bw() +
facet_wrap(~ArtistGen)
}
ggbox.group <- function(var, var_name){
ggplot(data = data, aes(x = var, y = factor(ArtistGen))) +
geom_boxplot(fill = gen.pal, col = "black",
outlier.color = 'red') +
xlab(var_name) + ylab('K-pop Generation') +
ggtitle(paste(var_name, "by Generation")) +
theme_bw()
}
audio <- fread("audiofeatures_clean.csv")
artists <- fread("artist_df.csv")[,-1]
data <- merge(artists, audio, by.x = 'ID', by.y = 'artist_uri')
data <- na.omit(data) #remove NAs (there were less than 5 rows)
data = data %>% mutate(Generation = floor(Generation),
release_date = as.POSIXct(release_date, format = "%m/%d/%y"),
mode = as.factor(mode),
key = factor(key),
time_signature = as.factor(time_signature),
duration_mmss = format(as.POSIXct(Sys.Date()) + duration_ms/1000, "%M:%S"),
duration_ms = duration_ms/1000) %>%
select(-artist) %>%
rename(Artist_uri = ID,
ArtistType = Type,
ArtistGender = Gender,
ArtistGen = Generation,
ArtistDebut = DebutYear,
duration = duration_ms)
#albums <- unique(select(data, Artist, album))
# head(data, n=5)
There are intro and outro songs on albums that serve as aesthetic tracks to tie the entire album together as one artistic work. However, these intro and outro songs do not serve to categorize the kpop music genre as they simply act like 'filler' works and not ones that are candidates to actively be promoted in the commercial music market.
Therefore, we would like to remove these types of tracks from the dataset as they are the source for skewness in instrumentalness and speechiness being on the high end and for causing skewness in song duration from being on the low end.
remove songs that have the words: intro, outro, interlude in the song name, songs that are longer than 10 minutes. And remove songs less than or equal to 2 minutes if they are outliers in the areas of instrumentalness or speechiness.
This does not remove every single 'filler song' but it removes most of them.
remove <- data %>% filter(str_detect(song_name, "intro") | str_detect(song_name, "outro") | str_detect(song_name, "interlude") | duration >= 600 | duration <= 60 | (duration <= 120.5 & (instrumentalness >= 0.50 | speechiness >= 0.40 | speechiness == 0)))
# short <- filter(data, duration <= 2)
# nrow(short)
nrow(remove)
## [1] 380
data <- data[!data$song_uri %in% remove$song_uri, ]
#fwrite(data, "kpopdata.csv")
Let's look at a summary
summary(data)
## Artist_uri Artist ArtistType ArtistGender
## Length:12012 Length:12012 Length:12012 Length:12012
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## ArtistGen ArtistDebut song_name song_uri
## Min. :1.000 Min. :1992 Length:12012 Length:12012
## 1st Qu.:2.000 1st Qu.:2007 Class :character Class :character
## Median :2.000 Median :2010 Mode :character Mode :character
## Mean :2.262 Mean :2009
## 3rd Qu.:3.000 3rd Qu.:2015
## Max. :4.000 Max. :2020
##
## album album_uri release_date
## Length:12012 Length:12012 Min. :1992-03-23 00:00:00
## Class :character Class :character 1st Qu.:2012-01-02 12:00:00
## Mode :character Mode :character Median :2015-09-10 00:00:00
## Mean :2014-02-22 20:34:05
## 3rd Qu.:2018-05-08 00:00:00
## Max. :2021-01-20 00:00:00
##
## popularity duration acousticness danceability
## Min. : 0.00 Min. : 60.74 Min. :0.0000038 Min. :0.0733
## 1st Qu.:11.00 1st Qu.:198.65 1st Qu.:0.0332000 1st Qu.:0.5920
## Median :23.00 Median :214.17 Median :0.1100000 Median :0.6740
## Mean :25.88 Mean :217.83 Mean :0.1955260 Mean :0.6585
## 3rd Qu.:39.00 3rd Qu.:233.99 3rd Qu.:0.2820000 3rd Qu.:0.7400
## Max. :94.00 Max. :555.52 Max. :0.9900000 Max. :0.9770
##
## energy instrumentalness key liveness
## Min. :0.000923 Min. :0.0000000 0 :1482 Min. :0.0116
## 1st Qu.:0.681750 1st Qu.:0.0000000 1 :1384 1st Qu.:0.0932
## Median :0.813000 Median :0.0000000 7 :1247 Median :0.1380
## Mean :0.769973 Mean :0.0080314 11 :1120 Mean :0.1925
## 3rd Qu.:0.896000 3rd Qu.:0.0000013 5 :1089 3rd Qu.:0.2740
## Max. :0.999000 Max. :0.9510000 6 :1008 Max. :0.9830
## (Other):4682
## loudness mode speechiness tempo time_signature
## Min. :-27.040 0:4673 Min. :0.02200 Min. : 47.6 0: 0
## 1st Qu.: -5.460 1:7339 1st Qu.:0.03860 1st Qu.:102.0 1: 11
## Median : -4.191 Median :0.05510 Median :121.9 3: 270
## Mean : -4.517 Mean :0.07739 Mean :121.5 4:11688
## 3rd Qu.: -3.163 3rd Qu.:0.09133 3rd Qu.:135.0 5: 43
## Max. : 0.394 Max. :0.95700 Max. :218.1
##
## valence duration_mmss
## Min. :0.0349 Length:12012
## 1st Qu.:0.4350 Class :character
## Median :0.6160 Mode :character
## Mean :0.5964
## 3rd Qu.:0.7680
## Max. :0.9840
##
release date distribution
#gghist(data$release_date, 'release date')
The popularity measure from Spotify is defined as: > The popularity of the track. The value will be between 0 and 100, with 100 being the most popular. The popularity of a track is a value between 0 and 100, with 100 being the most popular. The popularity is calculated by algorithm and is based, in the most part, on the total number of plays the track has had and how recent those plays are. Generally speaking, songs that are being played a lot now will have a higher popularity than songs that were played a lot in the past. Duplicate tracks (e.g. the same track from a single and an album) are rated independently. Artist and album popularity is derived mathematically from track popularity. Note that the popularity value may lag actual popularity by a few days: the value is not updated in real time.
popularity distribution
gghist(data$popularity, 'Popularity')
The mode for popularity of songs in the entire dataset is around 10, and the majority of songs in the dataset have a popularity score of below 50. In otherwords, very few songs in the dataset have very high popularity scores. This is appropriate since Spotify's popularity score incorporates how recent the plays are (more recent the more popular). The vast majority of songs in the dataset older, released prior to 2018. The share of data that would be considered 'recent' is much smaller. Therefore, what we are seeing in the distribution is to be expected.
Furthermore, for each artist, I collected their ENTIRE discography. Naturally, not all songs on an artists' album will be popular, even for a very popular artist. It makes sense that a relatively smalll portion of total songs in the dataset are extremely popular.
popularity by generation
gghist.group(data$popularity, "Popularity")
From the histograms we can see that the shapes of the 1st and 2nd generation distributions are heavily right skewed, whereas the distribution of popularity in the 3rd and 4th generation both appear roughly normal.
by generation
ggbox.group(data$popularity, "Popularity")
Again, due to the time sensitivty of Spotify's popularity score, we can see that the center of the data for each generation moves higher as the generation increases (to newest generation of kpop).
Despite first generation kpop groups having an overall lower median of popularity, there are clearly still songs from that generation of artists that still have high popularity today. However, if we take a quick look at the songs that do have a high popularity (above 50), many of those songs have been released well past their classified generation due to the longevity of their careers
data %>% filter(ArtistGen ==1 & popularity >= 50) %>% select(Artist, ArtistDebut, song_name, release_date, popularity)
## Artist ArtistDebut song_name release_date
## 1: SHINHWA 1998 perfect man 2002-03-01
## 2: J.Y. Park 1994 when we disco (duet with sunmi) 2020-08-12
## 3: J.Y. Park 1994 fever 2019-12-01
## 4: PSY 2001 gangnam style (강남스타일) 2012-01-01
## 5: PSY 2001 daddy 2015-12-01
## 6: PSY 2001 new face 2017-05-10
## 7: PSY 2001 hangover 2014-06-09
## 8: BoA 2000 better 2020-12-01
## 9: BoA 2000 masayume chasing 2014-09-03
## 10: BoA 2000 only one 2012-07-22
## 11: BoA 2000 no.1 2002-01-04
## 12: Rain 2002 switch to me (duet with jyp) 2020-12-31
## 13: Rain 2002 summer hate (zico feat. rain) 2020-07-01
## popularity
## 1: 55
## 2: 62
## 3: 51
## 4: 74
## 5: 57
## 6: 54
## 7: 50
## 8: 66
## 9: 57
## 10: 56
## 11: 51
## 12: 62
## 13: 59
The two songs that have truly stood the test of time are Boa's No.1 and SHINHWA's perfect man which were both released in 2002.
Duration is the track length measured by Spotify in milliseconds. However, I have converted the measurement to seconds (duration in milliseconds / 1000) in order to make the analysis more interpretable.
duration distribution
gghist(data$duration, 'Duration (seconds)')
As you can see the majority of tracks are between 2.5 - 5 minutes long. This is typical as most pop songs are 2-5 minutes. Upon investigation, there are two significantly long songs that have been removed.:
* Turbo's non-stop summer dj remix which is 22 minutes long. Likely a track used to play at clubs or party events.
* Orange Caramel's magic - origin at 13 minutes.
The rest of the songs are 9 minutes or less.
duration by generation
gghist.group(data$duration, "Duration (seconds)")
by generation
ggbox.group(data$duration, "Duration (Seconds)")
Overall, the distributions look fairly similar. However, songs from the first generation tend to have longer songs than the rest of the generations. By looking at the boxplots we can see that the median length of song tends to decrease as the generation increases. Therefore, it seems that songs are becoming shorter as time goes on.
A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
acousticness distribution
gghist(data$acousticness, 'Acousticness')
#gghist(sqrt(data$acousticness), 'acousticness square root')
Overall, kpop music is heavily influenced by styles like edm, hip hop, electronic, tropical house...etc.... [find a resource to make a conclusive statement about this?')]. Therefore the backing tracks for the singers use a lot of electronic sounds and instrumentation. It is appropriate for this dataset that the acousticness feature to be skewed to the right where majority of the tracks have acousticness levels below 0.25
acousticness by generation
gghist.group(data$acousticness, "Acousticness")
Generally, the shape of the distribution of Acousticness for each generation is roughly similar: right skewed. It appears that the first generation has typically lower levels of acousticness than the other generations.
by generation
ggbox.group(data$acousticness, "Acousticness")
Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
danceability distribution
gghist(data$danceability, 'danceability')
Kpop is well known for its focus on eye catching choreography and dance to accompany the song. [provide some resources describing this?]. Therefore it is understandable that we are observing the majority of the tracks to be a majority above 0.50 on the danceability and low frequency tails below. The distribution looks negative skewed normal.
Danceability by generation
gghist.group(data$danceability, "Danceability")
Overall, each generation is has a similar distribution all with centers just below 0.75. With the boxplots we can more easily compare the center of the distribution.
by generation
ggbox.group(data$danceability, "Danceability")
While all generations have comparable high levels of danceability, the first generation has the higher typical level of danceability with the 4th generation right behind it. The 1st generation was highly influenced by american boy bands and hip hop, with this style, there were many popular hiphop and dance idol groups.
Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
energy distribution
gghist(data$energy, 'Energy')
As expected for music that tends to have high levels of danceability, the energy levels are very high for K-pop songs with the typical value being around 0.85. K-pop is especially known for their upbeat, high energy tracks, so it is no surprise that the distribution has a high typical value with a tail to the left.
energy by generation
gghist.group(data$energy, "Energy")
The distributions all have a similar shape across generations. However, as the generations increase, the tails seem to get shorter and there is less variation.
by generation
ggbox.group(data$energy, "Energy")
With the boxplots, we can clearly see that the center for all generations is all about the same, with the 4th generation having slightly higher levels of energy than the rest.
Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
instrumentalness distribution
gghist(data$instrumentalness, 'instrumentalness')
gghist(log(data$instrumentalness), 'instrumentalness (log transform)')
## Warning: Removed 8841 rows containing non-finite values (stat_bin).
Note: the log of the instrumentalness histogram omits songs where instrumentalness is equal to 0. In otherwords, you can interpret the log transform plot to be of the nonzero observations.
instrumentalness by generation
gghist.group(data$instrumentalness, "Instrumentalness")
gghist.group(log(data$instrumentalness), "Instrumentalness (Log Transform)")
## Warning: Removed 8841 rows containing non-finite values (stat_bin).
by generation
ggbox.group(log(data$instrumentalness), 'Instrumentalness (Log Transform)')
## Warning: Removed 8841 rows containing non-finite values (stat_boxplot).
The key the track is in. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on.
pitch.class <- c("C", "C♯/D♭", "D", "D♯/E♭", "E", "F", "F♯/G♭", "G", "G♯/A♭", "A", "A♯/B♭", "B")
pitch.class <- c("C", "C♯/D-flat", "D", "D♯/E-flat", "E", "F", "F♯/G-flat", "G", "G♯/A-flat", "A", "A♯/B-flat", "B")
names(pitch.class) <- 0:11
pitch.class
## 0 1 2 3 4 5
## "C" "C♯/D-flat" "D" "D♯/E-flat" "E" "F"
## 6 7 8 9 10 11
## "F♯/G-flat" "G" "G♯/A-flat" "A" "A♯/B-flat" "B"
key distribution (categorical 1-12)
ggplot(data, aes(key)) +
geom_bar() +
theme_minimal() +
scale_x_discrete(labels = pitch.class)
keys <- dcast(data.frame(table(data %>% select(key, ArtistGen))), key~ArtistGen, value.var = "Freq")
colnames(keys) <- c("key", "gen1", "gen2", "gen3", "gen4")
#keys <- mutate(keys, gen1 = gen1/sum(gen1), gen2 = gen2/sum(gen2), gen3 = gen3/sum(gen3), gen4 = gen4/sum(gen4))
#keys <- reshape(keys, idvar = "key", timevar = "ArtistGen", direction = "wide")
ggplot(keys) +
# remove axes and superfluous grids
theme_classic() +
theme(axis.title = element_blank(),
axis.ticks.y = element_blank(),
axis.line = element_blank()) +
# add a dummy point for scaling purposes
geom_point(aes(x = 12, y = key),
size = 0, col = "white") +
# add the horizontal discipline lines
geom_hline(yintercept = 1:12, col = "grey80") +
# add point for gen 1
geom_point(aes(x = gen1, y = key),
size = 8, col = gen.pal[1]) +
# add point for gen 2
geom_point(aes(x = gen2, y = key),
size = 8, col = gen.pal[2])
# add point for gen 3
geom_point(aes(x = gen3, y = key),
size = 8, col = gen.pal[3]) +
# add point for gen 4
geom_point(aes(x = gen4, y = key),
size = 8, col = gen.pal[4])
ggplot(data, aes(y=..count../sum(..count..))) +
geom_bar(aes(fill = key), position = "dodge") +
xlab('Generation') + ylab('relative frequency') +
theme_bw()+
facet_grid(ArtistGen ~.)
surprise surprise, the majority of songs use the key of C.
Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
During the data collection and cleaning process, I intentionally omitted any live performance recording of a song from the dataset. This is because these live performances are just a duplication of the originially commercially released song.
Any song detected as live is just subject to the Spotify algorithm detection and is not actually a live performance. This variable will be removed from analysis.
liveness distribution
gghist(data$liveness, 'liveness')
gghist.group(data$liveness, 'Liveness')
by generation
ggbox.group(data$liveness, "Liveness")
All have the same liveness distribution and roughly the same centers with the exception of generation 4. However, since I removed all live recordings of songs, any detection of liveness is due to the spotify api. Since the api picked up on high levels of liveness, this may show that the algorithm is not the most accurate measure of detecting some of these audio features in songs.
The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
loudness distribution
gghist(data$loudness, 'loudness')
gghist.group(data$loudness, 'Loudness')
by generation
ggbox.group(data$loudness, "Loudness")
Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
mode distribution (also categorical)
ggplot(data, aes(mode)) +
geom_bar(aes(fill = mode)) +
scale_fill_manual("legend", values = c("0" = deux.pal[1], "1" = deux.pal[2])) +
#scale_y_continuous(labels = percent()) +
ylab("Frequency") + xlab("Musical Mode") +
ggtitle("Musical Mode") +
theme_minimal()
ggplot(data, aes(ArtistGen, y=..count../sum(..count..))) +
geom_bar(aes(fill = mode), position = "fill") +
scale_fill_manual("legend", values = c("0" = deux.pal[1], "1" = deux.pal[2])) +
xlab('Generation') + ylab('Relative Frequency') +
ggtitle("Musical Mode by Generation") +
#scale_y_continuous(labels=percent_format()) +
theme_minimal() #+
#facet_wrap(~ArtistGen)
Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
speechiness distribution
gghist(data$speechiness, 'speechiness')
gghist(log(data$speechiness), 'speechiness (log transform)')
Because this dataset is of songs only and no podcasts or book readings, we would expect this heavily skewed shape of the distribution where the majority of the data is under 0.25 for speachiness. However, it is very difficult to see the details of the long right tail. For this reason, we will also investigate the distribution on the log scale. On the log scale, the data is still strongly skewed to the right, however, we can see more details in the right tail for the speechiness values above 0.25 (normal scale). There's a gradual decline in songs that have high levels of speechiness on the track, however, we can see a clear drop off for log values from -1 to 0. These are likely to be oddities of the data. Perhaps there are some spoken tracks between songs on albums, serving a similar artistic purpose as instrumental intro/outro tracks.
gghist.group(data$speechiness, "Speechiness")
gghist.group(log(data$speechiness), "Speechiness (Log Transform)")
Now, we are looking at the histograms by kpop generation. We can see that all are heavily right skewed just like the overall distribution. However we can see some differences between generations. For example, kpop generation 2 has the least amount of speechiness amongst tracks with the highest bar at speachiness levels at or close to zero. The 4th generation has a higher center of speechiness than the rest of the generations. Perhaps there is more rap incorporated into the music that is released in the 4th generation of kpop than the other eras.
However, behond these observations, it is difficult to compare the distribution for each generation amongst the higher levels of speechiness. For this, we will look at the distribution on the log scale. As observed for the 4th generation, not only is the center of the distribution higher on speechiness, but the upper tail is shorter than the 1st and 2nd generation while maintaining higher concentration of its distribution between log scale -2 and -1. The generation that also has a shorter tail to the right is generation 2. But not only is the tail shorter (ending at around log sclaed value -1), but the density of songs at speechiness levels from about -2.5 to -1 are consistently dropping and seem to be lower than other generations.
Generation 1 seems to have the most variation in speechiness with a similar center to generation 3, but greater variation in the density for speechiness levels of -2 to 0 on the log scale. It has the highest levels of speechiness as well. This level of variation could be attributed to the experimental nature of music in the first generation. This was the time that kpop was starting to become the model and style of what we listen to in the modern era. However, many music companies and artists were still trying to define their sound and fit into the demand of the market. The generally high levels of speechiness could be due to rap and hiphop elements being heavily influenced into the kpop genre at the time which has heavier levels of speechiness than the generic pop style.
by generation
ggbox.group(data$speechiness, "Speechiness")
The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
tempo distribution
gghist(data$tempo, 'tempo')
As expected, most songs have a tempo between 90 and 160. Any song with a tempo above 90 is fast at an allegro pace.
gghist.group(data$tempo, "Tempo")
by generation
ggbox.group(data$tempo, "Tempo")
An estimated overall time signature of a track. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure).
time_signature distirbution (categorical...)
ggplot(data, aes(time_signature)) +
geom_bar(aes(fill = time_signature)) +
theme_minimal()
ggplot(data, aes(ArtistGen, y=..count../sum(..count..))) +
geom_bar(aes(fill = time_signature), position = "fill") +
xlab('Generation') + ylab('relative frequency') +
ggtitle("Time Signature by Generation") +
#scale_y_continuous(labels=percent_format()) +
theme_bw()
A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
valence distribution
gghist(data$valence, 'Valence')
Because pop as a genre is generally upbeat with positive vibes (is this a professional way to write this?), this distribution with the majority of the data having above 0.50 in Valence is expected. Let's see how they compare between generations of artists.
gghist.group(data$valence, "Valence")
generations 2,3, and 4 roughly have the same distribution shape with their centers at around 0.60. The distribution of Valence is somewhat similar, however, it has a slightly higher center and a longer tail to the left.
by generation
ggbox.group(data$valence, "Valence")
numdata <- select(data, ArtistGen, ArtistDebut, popularity, duration, acousticness, danceability, energy, instrumentalness, key, liveness, loudness, mode, speechiness, tempo, time_signature, valence)
numdata <- data.frame(lapply(numdata, function(x) as.numeric(x)))
cor(numdata)
## ArtistGen ArtistDebut popularity duration
## ArtistGen 1.000000000 0.906514067 0.5723777796 -0.210535881
## ArtistDebut 0.906514067 1.000000000 0.5667290513 -0.241016480
## popularity 0.572377780 0.566729051 1.0000000000 -0.170791773
## duration -0.210535881 -0.241016480 -0.1707917732 1.000000000
## acousticness 0.014549816 0.050739753 -0.0090541793 0.124974628
## danceability -0.036179939 -0.054137420 -0.0006195172 -0.232245569
## energy 0.062255397 0.042977198 0.0267298591 -0.178145385
## instrumentalness -0.017406312 -0.022642638 -0.1025662968 -0.057839074
## key -0.013937044 -0.015586264 0.0108900865 -0.017880133
## liveness 0.025161977 0.017494145 0.0070356134 -0.066393658
## loudness 0.236131262 0.266613484 0.2056198602 -0.141228810
## mode -0.012920685 -0.003463798 -0.0469199885 0.090827057
## speechiness 0.064466625 0.045482370 0.1079388233 -0.172208248
## tempo 0.031915743 0.022804079 0.0090505430 -0.016494943
## time_signature -0.005943726 -0.003714702 -0.0187060828 -0.009822828
## valence -0.067696313 -0.071851116 -0.0553622209 -0.225422839
## acousticness danceability energy instrumentalness
## ArtistGen 0.014549816 -0.0361799390 0.06225540 -0.017406312
## ArtistDebut 0.050739753 -0.0541374202 0.04297720 -0.022642638
## popularity -0.009054179 -0.0006195172 0.02672986 -0.102566297
## duration 0.124974628 -0.2322455693 -0.17814539 -0.057839074
## acousticness 1.000000000 -0.3075295937 -0.65765431 0.016754608
## danceability -0.307529594 1.0000000000 0.25074536 -0.051024029
## energy -0.657654309 0.2507453642 1.00000000 -0.075354007
## instrumentalness 0.016754608 -0.0510240285 -0.07535401 1.000000000
## key -0.010257285 0.0239435543 0.01130035 -0.004073038
## liveness -0.090899988 -0.0471270638 0.18009905 -0.016079199
## loudness -0.409504117 0.1445412702 0.70988744 -0.195035866
## mode 0.164557550 -0.1256818785 -0.16571772 0.002112187
## speechiness -0.141270250 0.0433279645 0.20014335 -0.043123417
## tempo -0.086198952 -0.2106224421 0.13425309 -0.007106811
## time_signature -0.145187354 0.1372308130 0.15963495 -0.034589610
## valence -0.345401834 0.5208471914 0.48358863 -0.057618078
## key liveness loudness mode
## ArtistGen -0.013937044 0.025161977 0.236131262 -0.012920685
## ArtistDebut -0.015586264 0.017494145 0.266613484 -0.003463798
## popularity 0.010890087 0.007035613 0.205619860 -0.046919988
## duration -0.017880133 -0.066393658 -0.141228810 0.090827057
## acousticness -0.010257285 -0.090899988 -0.409504117 0.164557550
## danceability 0.023943554 -0.047127064 0.144541270 -0.125681878
## energy 0.011300354 0.180099051 0.709887441 -0.165717724
## instrumentalness -0.004073038 -0.016079199 -0.195035866 0.002112187
## key 1.000000000 0.000551498 0.004219365 -0.187055019
## liveness 0.000551498 1.000000000 0.097565306 -0.042669191
## loudness 0.004219365 0.097565306 1.000000000 -0.088570431
## mode -0.187055019 -0.042669191 -0.088570431 1.000000000
## speechiness 0.027426938 0.095507295 0.051450160 -0.098665395
## tempo -0.007013553 0.024452979 0.110576121 0.005816266
## time_signature 0.006669893 0.013965238 0.126839768 -0.030798285
## valence 0.018697071 0.047556520 0.323900618 -0.138859316
## speechiness tempo time_signature valence
## ArtistGen 0.06446662 0.031915743 -0.005943726 -0.06769631
## ArtistDebut 0.04548237 0.022804079 -0.003714702 -0.07185112
## popularity 0.10793882 0.009050543 -0.018706083 -0.05536222
## duration -0.17220825 -0.016494943 -0.009822828 -0.22542284
## acousticness -0.14127025 -0.086198952 -0.145187354 -0.34540183
## danceability 0.04332796 -0.210622442 0.137230813 0.52084719
## energy 0.20014335 0.134253094 0.159634949 0.48358863
## instrumentalness -0.04312342 -0.007106811 -0.034589610 -0.05761808
## key 0.02742694 -0.007013553 0.006669893 0.01869707
## liveness 0.09550730 0.024452979 0.013965238 0.04755652
## loudness 0.05145016 0.110576121 0.126839768 0.32390062
## mode -0.09866539 0.005816266 -0.030798285 -0.13885932
## speechiness 1.00000000 0.128397309 0.020898562 0.13012512
## tempo 0.12839731 1.000000000 -0.056790035 0.03745051
## time_signature 0.02089856 -0.056790035 1.000000000 0.11368780
## valence 0.13012512 0.037450514 0.113687800 1.00000000
prior to any sort of data transformations, the only highly correlated variables are Artist Debut and Artist Generation. This multicolinearity will not affect our analysis since we will not be investigating the relationship between those two variables extensively. This observation is reasonable and should be expected since the concept of kpop generations is partly defined in when the artist debuted into the kpop market and the time in which they promoted their msuic.
The next highest correlation can be observed between energy and loudness with a correlation of 0.72207761. This is suggesting a moderately strong positive association between energy and loudness in which, as the energy level detected increases, the loudness of the music also increases and vice versa. The third strongest correlation value is -0.671291890 between energy and acousticness. This moderate negative association indicates a possible association where the increase of energy in a song would correspond with a decrease in acousticness of a song and vice versa.
With moderate positive associations we can observe the following relationships:
Popularity and Artist Generation: 0.573493768. Since the spotify algorithm measures popularity upon total plays within the recent time, some association between the score and time of when the song was being actively promoted is to be expected. This positive association would mean that the higher the popularity the later the generation we should expect (gen 4 rather than gen 1). What is surprising is that popularity and song's placement within a kpop generation is not stronger. This shows that many of the older songs are still listened to actively today.
Danceability and valence: 0.53901204. This positive relationship can be interpreted as with higher danceability, a song is expected to increase in valence (measures happier mood versus sadder moods). This relationship is reasonable since one would naturally be more drawn to dance to a happier song.
Energy and Valence: 0.48875566
transformdata <- mutate(numdata, speechiness = log(speechiness + 0.001), instrumentalness = log(instrumentalness + 0.001))
cor(transformdata)
## ArtistGen ArtistDebut popularity duration
## ArtistGen 1.000000000 0.906514067 0.5723777796 -0.210535881
## ArtistDebut 0.906514067 1.000000000 0.5667290513 -0.241016480
## popularity 0.572377780 0.566729051 1.0000000000 -0.170791773
## duration -0.210535881 -0.241016480 -0.1707917732 1.000000000
## acousticness 0.014549816 0.050739753 -0.0090541793 0.124974628
## danceability -0.036179939 -0.054137420 -0.0006195172 -0.232245569
## energy 0.062255397 0.042977198 0.0267298591 -0.178145385
## instrumentalness -0.076223918 -0.091292205 -0.1238417302 -0.078308526
## key -0.013937044 -0.015586264 0.0108900865 -0.017880133
## liveness 0.025161977 0.017494145 0.0070356134 -0.066393658
## loudness 0.236131262 0.266613484 0.2056198602 -0.141228810
## mode -0.012920685 -0.003463798 -0.0469199885 0.090827057
## speechiness 0.090656116 0.072291720 0.1288500472 -0.226029594
## tempo 0.031915743 0.022804079 0.0090505430 -0.016494943
## time_signature -0.005943726 -0.003714702 -0.0187060828 -0.009822828
## valence -0.067696313 -0.071851116 -0.0553622209 -0.225422839
## acousticness danceability energy instrumentalness
## ArtistGen 0.014549816 -0.0361799390 0.06225540 -0.076223918
## ArtistDebut 0.050739753 -0.0541374202 0.04297720 -0.091292205
## popularity -0.009054179 -0.0006195172 0.02672986 -0.123841730
## duration 0.124974628 -0.2322455693 -0.17814539 -0.078308526
## acousticness 1.000000000 -0.3075295937 -0.65765431 -0.034549683
## danceability -0.307529594 1.0000000000 0.25074536 -0.006192921
## energy -0.657654309 0.2507453642 1.00000000 -0.041148731
## instrumentalness -0.034549683 -0.0061929212 -0.04114873 1.000000000
## key -0.010257285 0.0239435543 0.01130035 0.012250982
## liveness -0.090899988 -0.0471270638 0.18009905 -0.013443060
## loudness -0.409504117 0.1445412702 0.70988744 -0.211902722
## mode 0.164557550 -0.1256818785 -0.16571772 -0.013206491
## speechiness -0.244083790 0.1214381664 0.33506737 -0.059728105
## tempo -0.086198952 -0.2106224421 0.13425309 0.008432307
## time_signature -0.145187354 0.1372308130 0.15963495 -0.046103671
## valence -0.345401834 0.5208471914 0.48358863 -0.036863141
## key liveness loudness mode
## ArtistGen -0.013937044 0.025161977 0.236131262 -0.012920685
## ArtistDebut -0.015586264 0.017494145 0.266613484 -0.003463798
## popularity 0.010890087 0.007035613 0.205619860 -0.046919988
## duration -0.017880133 -0.066393658 -0.141228810 0.090827057
## acousticness -0.010257285 -0.090899988 -0.409504117 0.164557550
## danceability 0.023943554 -0.047127064 0.144541270 -0.125681878
## energy 0.011300354 0.180099051 0.709887441 -0.165717724
## instrumentalness 0.012250982 -0.013443060 -0.211902722 -0.013206491
## key 1.000000000 0.000551498 0.004219365 -0.187055019
## liveness 0.000551498 1.000000000 0.097565306 -0.042669191
## loudness 0.004219365 0.097565306 1.000000000 -0.088570431
## mode -0.187055019 -0.042669191 -0.088570431 1.000000000
## speechiness 0.028510091 0.101868205 0.158330805 -0.142517812
## tempo -0.007013553 0.024452979 0.110576121 0.005816266
## time_signature 0.006669893 0.013965238 0.126839768 -0.030798285
## valence 0.018697071 0.047556520 0.323900618 -0.138859316
## speechiness tempo time_signature valence
## ArtistGen 0.09065612 0.031915743 -0.005943726 -0.06769631
## ArtistDebut 0.07229172 0.022804079 -0.003714702 -0.07185112
## popularity 0.12885005 0.009050543 -0.018706083 -0.05536222
## duration -0.22602959 -0.016494943 -0.009822828 -0.22542284
## acousticness -0.24408379 -0.086198952 -0.145187354 -0.34540183
## danceability 0.12143817 -0.210622442 0.137230813 0.52084719
## energy 0.33506737 0.134253094 0.159634949 0.48358863
## instrumentalness -0.05972810 0.008432307 -0.046103671 -0.03686314
## key 0.02851009 -0.007013553 0.006669893 0.01869707
## liveness 0.10186820 0.024452979 0.013965238 0.04755652
## loudness 0.15833081 0.110576121 0.126839768 0.32390062
## mode -0.14251781 0.005816266 -0.030798285 -0.13885932
## speechiness 1.00000000 0.134909314 0.058753696 0.22084061
## tempo 0.13490931 1.000000000 -0.056790035 0.03745051
## time_signature 0.05875370 -0.056790035 1.000000000 0.11368780
## valence 0.22084061 0.037450514 0.113687800 1.00000000
No major changes in correlations. After the transform there are no new strong correlations between instrumentalness or speechiness with other variables.